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O | We study the problem of releasing k-way marginals of a database D £ ({0, l} d ) n , while 

preserving differential privacy. The answer to a fe-way marginal query is the fraction of -D's 
£NJ | records x <G {0, l} d with a given value in each of a given set of up to k columns. Marginal 

queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms 
for privately releasing marginal queries has been identified as an important open problem in 
private data analysis (cf. Barak et. al., PODS '07). 

We give an algorithm that runs in time dPtyfy and releases a private summary capable 
of answering any fc-way marginal query with at most ±.01 error on every query as long as 
n > d°(^} . To our knowledge, ours is the first algorithm capable of privately releasing marginal 
queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the 
number of fc-way marginal queries, which is dP^ (for k <C d). 



1 Introduction 



Consider a database D S ({0, l} d ) n in which each of the n = \D\ rows corresponds to an individual's 
record, and each record consists of d binary attributes. The goal of privacy-preserving data analysis 
is to enable rich statistical analyses on the database while protecting the privacy of the individuals. 
In this work, we seek to achieve differential privacy [6], which guarantees that no individual's data 
has a significant influence on the information released about the database. 



*An extended abstract of this work appears in ICALP '12 [22] . 
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One of the most important classes of statistics on a dataset is its marginals. A marginal query 
is specified by a set S C [d] and a pattern t € {0, l}' 5 '. The query asks, "What fraction of 
the individual records in D has each of the attributes j 6 S set to i,?" A major open problem 
in privacy-preserving data analysis is to efficiently create a differentially private summary of the 
database that enables analysts to answer each of the 3 d marginal queries. A natural subclass of 
marginals are k-way marginals, the subset of marginals specified by sets S C [d] such that \S\ < k. 

Privately answering marginal queries is a special case of the more general problem of privately 
answering counting queries on the database, which are queries of the form, "What fraction of 
individual records in D satisfy some property 5?" Early work in differential privacy [5J [21 E] showed 
how to approximately answer any set of of counting queries Q by perturbing the answers with 
appropriately calibrated noise, providing good accuracy (say, within ±.01 of the true answer) as 
long as \D\ > \ Q\ 1/2 . 

In a setting where the queries arrive online, or are known in advance, it may be reasonable 
to assume that \D\ > IQ] 1 / 2 . However, many situations necessitate a non-interactive data release, 
where the data owner computes and publishes a single differentially private summary of the database 
that enables analysts to answer a large class of queries, say all fc-way marginals for a suitable 
choice of k. In this case 

\Q\ = d @( - k \ and it may be impractical to collect enough data to ensure 
\D\ > IQI 1 / 2 . Fortunately, the remarkable work of Blum et. al. [3] and subsequent refinements 
El fl9l fl4l \13~[ [12] , have shown how to privately release approximate answers to any set of counting 
queries, even when \ Q\ is exponentially larger than \D\. For example, these algorithms can release 
all fc-way marginals as long as \D\ > @(ky/d). Unfortunately, all of these algorithms have running 
time at least 2 d , even when \Q\ is the set of 2-way marginals (and this is inherent for algorithms 
that produce "synthetic data" [23]; as discussed below). 

Given this state of affairs, it is natural to seek efficient algorithms capable of privately releasing 
approximate answers to marginal queries even when \D\ <C d k . A recent series of works [11[ HI [15] 
have shown how to privately release answers to /c-way marginal queries with small average error 
(over various distributions on the queries) with both running time and minimum database size much 
smaller than d k (e.g. dP^ 1 for product distributions |11[ [4] and minjci ^^^, d°^ dl for arbitrary 
distributions |15|). Hardt et. al. [15] also gave an algorithm for privately releasing fc-way marginal 
queries with small worst-case error and minimum database size much smaller than dr . However 
the running time of their algorithm is still dP^ k \ which is polynomial in the number of queries. 

In this paper, we give the first algorithms capable of releasing fc-way marginals up to small 
worst-case error, with both running time and minimum database size substantially smaller than 
d k . Specifically, we show how to create a private summary in time dP^^^ that gives approximate 
answers to all A;- way marginals as long as \D\ is at least When k = d, our algorithm runs 

in time 2°(^ d \ and is the first algorithm for releasing all marginals in time 2°( d \ 

1.1 Our Results and Techniques 

In this paper, we give faster algorithms for releasing marginals and other classes of counting queries. 

Theorem 1.1 (Releasing Marginals). There exists a constant C such that for every k,d,n € N 
with k < d, every a £ (0,1], and every e > 0, there is an e- differentially private sanitizer that, 
on input a database D G ({0, l} d ) n , runs in time \D\ ■ dF^ 10 ^ 1 / ^ and releases a summary that 
enables computing each of the k-way marginal queries on D up to an additive error of at most a, 
provided that \D\ > dP^^I^/e. 
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For notational convenience, we focus on monotone k-way disjunction queries. However, our re- 
sults extend straightforwardly to general non- monotone /c-way disjunction queries (see Section T4,ip . 
which are equivalent to fe-way marginals. A monotone fe-way disjunction is specified by a set S C [d] 
of size k and asks what fraction of records in D have at least one of the attributes in S set to 1. 

Our algorithm is inspired by a series of works reducing the problem of private query release to 
various problems in learning theory. One ingredient in this line of work is a shift in perspective 
introduced by Gupta, Hardt, Roth, and Ullman [11]. Instead of viewing disjunction queries as a 
set of functions on the database, they view the database as a function fjj : {0, l} d — > [0, 1], in which 
each vector s E {0, 1}^ is interpreted as the indicator vector of a set S C [d], and /d(s) equals 
the evaluation of the disjunction specified by S on the database D. They use the structure of the 
functions fjy to privately learn an approximation gr> that has small average error over any product 
distribution on disj unctions 

Cheraghchi, Klivans, Kothari, and Lee [I] observed that the functions fo can be approximated 
by a low-degree polynomial with small average error over the uniform distribution on disjunctions. 
They then use a private learning algorithm for low-degree polynomials to release an approximation 
to /d; and thereby obtain an improved dependence on the accuracy parameter, as compared to |llj , 

Hardt, Rothblum, and Servedio [15] observe that fr> is itself an average of disjunctions (each 
row of D specifies a disjunction of bits in the indicator vector s £ {0, l} d of the query), and thus 
develop private learning algorithms for threshold of sums of disjunctions. These learning algorithms 
are also based on low-degree approximations of sums of disjunctions. They show how to use their 
private learning algorithms to obtain a sanitizer with small average error over arbitrary distributions 
with running time and minimum database size dP^^^. They then are able to apply the private 
boosting technique of Dwork, Rothblum, and Vadhan [9] to obtain worst-case accuracy guarantees. 
Unfortunately, the boosting step incurs a blowup of d k in the running time. 

We improve the above results by showing how to directly compute (a noisy version of) a poly- 
nomial pd that is privacy-preserving and still approximates fo on all k-way disjunctions, as long 
as \D\ is sufficiently large. Specifically, the running time and the database size requirement of 
our algorithm are both polynomial in the number of monomials in p£>, which is d°(^\ By "di- 
rectly", we mean that we compute pr> from the database D itself and perturb its coefficients, rather 
than using a learning algorithm. Our construction of the polynomial po uses the same low-degree 
approximations exploited by Hardt et. al. in the development of their private learning algorithms. 

In summary, the main difference between prior work and ours is that prior work used learning 
algorithms that have restricted access to the database, and released the hypothesis output by the 
learning algorithm. In contrast, we do not make use of any learning algorithms, and give our release 
algorithm direct access to the database. This enables our algorithm to achieve a worst-case error 
guarantee while maintaining a minimal database size and running time much smaller than the size 
of the query set. Our algorithm is also substantially simpler than that of Hardt et. al. 

We also consider other families of counting queries. We define the class of r-of-k queries. Like 
a monotone fe-way disjunction, an r-of-k query is defined by a set S C [d] such that IS"! < k. The 
query asks what fraction of the rows of D have at least r of the attributes in S set to 1 . For r = 1 , 
these queries are exactly monotone k-way disjunctions, and r-of-A; queries are a strict generalization. 

Theorem 1.2 (Releasing r-of-k Queries). For every r,k,d,n £ N with r < k < d, every a € (0, 1], 
and every e > there is an e- differentially private sanitizer that, on input a database D £ ({0, l} d ) n , 

In their learning algorithm, privacy is defined with respect to the rows of the database D that defines Jd, not 
with respect to the examples given to the learning algorithm (unlike earlier works on "private learning" [16]). 



3 



runs in time \D\ ■ (P i}^ rk log ( 1 / a )) anc [ re l eases a summary that enables computing each of the r-of-k 
queries on D up to an additive error of at most a, provided that \D\ > d 

Note that monotone /c-way disjunctions are just r-oi-k queries where r = 1, thus Theorem 11.21 
implies a release algorithm for disjunctions with quadratically better dependence on log(l/a), at 
the cost of slightly worse dependence on k (implicit in the switch from O(-) to O(-)). 

Finally, we present a sanitizer for privately releasing databases in which the rows of the database 
are interpreted as decision lists, and the queries are inputs to the decision lists. That is, instead of 
each record in D being a string of d attributes, each record is an element of the set DLfc i7Tt , which 
consists of all length-/c decision lists over m input variables. (See Section 1431 for a precise definition.) 
A query is specified by a string y G {0, l} d and asks "What fraction of database participants would 
make a certain decision based on the input y?" 

As an example application, consider a database that allows high school students to express their 
preferences for colleges in the form of a decision list. For example, a student may say, "If the school 
is ranked in the top ten nationwide, I am willing to apply to it. Otherwise, if the school is rural, 
I am unwilling to apply. Otherwise, if the school has a good basketball team then I am willing to 
apply to it." And so on. Each student is allowed to use up to k attributes out of a set of m binary 
attributes. Our sanitizer allows any college (represented by its m binary attributes) to determine 
the fraction of students willing to apply. 

Theorem 1.3 (Releasing Decision Lists). For any k,m G N s.t. k < m, any a G (0,1], and any 
e > 1/n, there is an e- differentially private sanitizer with running time m^^ 10 ^ 1 /")) that, on 
input a database D G (DL/% )m ) n ; releases a summary that enables computing any length-k decision 
list query up to an additive error of at most a on every query, provided that \D\ > r^l^ 10 ^ 1 / )) je. 

For comparison, we note that all the results on releasing /c-way disjunctions (including ours) also 
apply to a dual setting where the database records specify a fc-way disjunction over m bits and the 
queries are m-bit strings (in this setting m plays the role of d). Theorem 11.31 generalizes this dual 
version of Theorem II .1\ as length-/c decision lists are a strict generalization of fe-way disjunctions. 

We prove the latter two results (Theorems 11.21 and II. 3D using the same approach outlined for 
marginals (Theorem II .ip . but with different low-degree polynomial approximations appropriate for 
the different types of queries. 



On Synthetic Data. An attractive type of summary is a synthetic database. A synthetic 
database is a new database D G ({0, l} d ) n whose rows are "fake", but such that D approximately 
preserves many of the statistical properties of the database D (e.g. all the marginals). Some of the 
previous work on counting query release has provided synthetic data, starting with Barak et. al. [1] 
and including $ Q3] . 

Unfortunately, Ullman and Vadhan [23] (building on [7]) have shown that no differentially 
private sanitizer with running time dP^ can take a database D G ({0, l} d ) n and output a private 
synthetic database D, all of whose 2-way marginals are approximately equal to those of D (assuming 
the existence of one-way functions). They also showed that there is a constant k G N such that no 
differentially private sanitizer with running time 2 rfl " (1> can output a private synthetic database, 
all of whose fc-way marginals are approximately equal to those of D (under stronger cryptographic 
assumptions) . 
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Table 1: Summary of prior results on differentially private release of fc-way marginals. The database 
size column indicates the minimum database size required to release answers to A;- way marginals up 
to an additive error of a. For clarity, we ignore the dependence on the privacy parameters and the 
failure probability of the algorithms. Notice that this paper contains the first algorithm capable of 
releasing fc-way marginals with running time and worst-case error substantially smaller than the 
number of queries. 

a Worst case error indicates that the accuracy guarantee holds for every marginal. The other types of error 
indicate that accuracy holds for random marginals over a given distribution from a particular class of distributions 
(e.g. product distributions). 

'"The results of [4] apply only to the uniform distribution over all marginals. 

When k = d, our sanitizer runs in time 2°( v ^ and releases a private summary that enables an 
analyst to approximately answer any marginal query on D. Prior to our work it was not known 
how to release any summary enabling approximate answers to all marginals in time 2 rfl " (1> . Thus, 
our results show that releasing a private summary for all marginal queries can be done considerably 
more efficiently if we do not require the summary to be a synthetic database (under the hardness 
assumptions made in |23j). 

2 Preliminaries 

2.1 Differentially Private Sanitizers 

Let a database D £ X n be a collection of n rows . . . , from a data universe X. We say that 
two databases D\, D2 £ X n are adjacent if they differ only on a single row, and we denote this by 
Di ~ D 2 . 

A sanitizer A : X n — > TZ takes a database as input and outputs some data structure in 1Z. We 
are interested in sanitizers that satisfy differential privacy. 

Definition 2.1 (Differential Privacy 0). A sanitizer A: X n — > TZ is (e, 5) -differentially private 
if for every two adjacent databases D, D' £ X n and every subset S C TZ, Pr [A(D) £ S] < 
e £ Pr [A(D') £ S] + 5. In the case where 5 = we say that A is e- differentially private. 

Since a sanitizer that always outputs _L satisfies Definition 12. 1( we also need to define what it 
means for a sanitizer to be accurate. In particular, we are interested in sanitizers that give accurate 
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answers to counting queries. A counting query is denned by a boolean predicate q: X — > {0,1}. 
We define the evaluation of the query g on a database D 6 X n to be q{D) = ± YJl=i q(x {i) ). We 
use Q to denote a set of counting queries. 

Since A may output an arbitrary data structure, we must specify how to answer queries in 
Q from the output A(D). Hence, we require that there is an evaluator £ : 1Z x Q — > M. that 
estimates q{D) from the output of A(D). For example, if A outputs a vector of "noisy answers" 
Z = (q(D) + Z q ) q£ Q, where Z q is a random variable for each q 6 Q, then 1Z = and £(Z,q) is 
the q-th component of Z. Abusing notation, we write q(Z) and q{A{D)) as shorthand for £(Z,q) 
and £{A{D), q), respectively. Since we are interested in the efficiency of the sanitization process as 
a whole, when we refer to the running time of A, we also include the running time of the evaluator 
£. We say that A is "accurate" for the query set Q if the values q(A(D)) are close to the answers 
q{D). Formally, 

Definition 2.2 (Accuracy). An output Z of a sanitizer A(D) is a-accurate for the query set Q if 
\q(Z) — q(D)\ < a for every q G Q. A sanitizer is (a, (3) -accurate for the query set Q if for every 
database D, 

Pr [Vg G Q, \q(A(D)) - q{D)\ < a] > 1 - /3, 
where the probability is taken over the coins of A. 

We will make use of the Laplace mechanism. Let Lap fc (a) denote a draw from the random 
variable over ]R fc in which each coordinate is chosen independently according to the density function 
Lap CT (x) oc e~\ x \/ a . Let D £ X n be a database and g : X n — > IR fc be a function such that for every 
pair of adjacent databases D ~ D', \\g{D)—g(D')\\ 00 < A. Then we have the following two theorems: 

Lemma 2.3 (Laplace Mechanism, e-Differential Privacy [6]). For D,g,k,A as above, the mech- 
anism A(D) = g(D) + Lap fc (Afc/e) satisfies e- differential privacy. Furthermore, for any /3 > 0, 
Pi A [\\g(D) - AiD)^ < a] > 1 - p, for a = 2Ak 2 log(fc/0)/e. 

The choice of the L\ norm in the accuracy guarantee of the lemma is for convenience, and 
doesn't matter for the parameters of Theorems 1 1 . ltfL3l (except for the hidden constants). 

If the privacy requirement is relaxed to (e, 5)-differential privacy (for 5 > 0), then it is sufficient 
to perturb each coordinate of g(D) with noise from a Laplace distribution of smaller magnitude, 
leading to smaller error. 

Lemma 2.4 (Laplace Mechanism, (e, 5)-Differential Privacy [5], EJ, [2j Ej). For D,g,k,A as above, 
and for every 5 > 0, the mechanism A(D) = g(D) + Lap fc (3Ay / /clog(l/5)/e) satisfies (e,5)- 
differential privacy. Furthermore, for any (3 > 0, Pr_4 [||g(D) — ^l(D)||i < a] > 1 — /3, for a = 
6AV^log(l/<5) log(k//3)/e. 

2.2 Query Function Families 

We take the approach of Gupta et. al. [11] and think of the database D as specifying a function 
fr> mapping queries q to their answers q(D), which we call the Q-representation of D. We now 
describe this transformation more formally: 

Definition 2.5 (Q-Function Family). Let Q = {Qy} y& Y Q c{o i} m ^ e a se ^ °^ coun ting queries on a 
data universe X, where each query is indexed by an m-bit string. We define the index set of Q to 
be the set Y Q = {y e {0, l} m | q y G Q}. 
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We define the Q-function family Tq = {f x : {0, l} m — > {0, l}}^^ as follows: For every possible 
database row x G X, the function /q iX : {0, l} m — > {0, 1} is defined as fQ )X (y) = Qy( x )- Given a 
database D G X n we define the function /q,d : {0, l} m — > [0, 1] where /q,d(<z) = ^ 5^7=1 /q x^il)- 
When Q is clear from context we will drop the subscript Q and simply write f x , fo, and J-. 

For some intuition about this transformation, when the queries are monotone k-w&y disjunctions 
on a database D G ({0,l} d ) n , the queries are defined by sets S C [d] , |5| < k. In this case each 
query can be represented by the d-bit indicator vector of the set S, with at most k non-zero entries. 
Thus we can take m = d and Yq = |y G {0, l} d \ Ylj=i Vj — 

2.3 Polynomial Approximations 

An m-variate real polynomial p G • • • ,y m ] of degree t and (L^) norm T can be written as 

P(y) = E n---im>o c ii,...,i™n^Li?/? where |c ilv .. Jm | < T for every ji,...,j m . Recall that there 

31 H hjm<t 

are at most ( m ^*) coefficients in an m-variate polynomial of total degree t. Often we will want 

to associate a polynomial p of degree t and norm T with its coefficient vector p G [— T, T] v * > . 
Specifically, p = (cj lr ..j m ) ji,...,j m >o ■ Given a vector p and a point y G {0, l} m we use p{y) to 

31 H hjm<t 

indicate the evaluation of the polynomial described by the vector p at the point y. Observe this 
is equivalent to computing p ■ y where y G {0, 1}( * ) is defined as yj lt ...,j m = YYeLi v{ e f° r every 

jl,- ■ ■ ,3m > 0, ji H him < 

Let 7\t be the family of all m-variate real polynomials of degree t and norm T. In many cases, 
the functions /q jX : {0, l} m — > {0, 1} can be approximated well on all the indices in Yq by a family 
of polynomials VtT with low degree and small norm. Formally: 

Definition 2.6 (Uniform Approximation by Polynomials). Given a family of m-variate functions 
J~ = {fx} x ^x an d a set Y C {0, l} m , we say that the family Vt,T uniformly ^-approximates J- on 
Y if for every x G X, there exists p x G P^t such that max ye y \f x {y) — p x (y)\ < 7- 

We say that VtT efficiently and uniformly ^-approximates T if there is an algorithm Vt that 
takes x G X as input, runs in time poly(log ( m ^~ ),logT), and outputs a coefficient vector p x 
such that max^gy \f x (y) - p x {y)\ < 7. 

3 From Polynomial Approximations to Data Release Algorithms 

In this section we present an algorithm for privately releasing any family of counting queries Q 
such that J-q that can be efficiently and uniformly approximated by polynomials. The algorithm 
will take an n-row database D and, for each row x G D, constructs a polynomial p x that uniformly 
approximates the function fQ )X (recall that fQ )X (q) = q(x), for each q G Q). From these, it 
constructs a polynomial po = - J2 X £dP% ^ na * uniformly approximates /q,d- The final step is to 
perturb each of the coefficients of po using noise from a Laplace distribution (Theorem I2.3[) and 
bound the error introduced from the perturbation. 

Theorem 3.1 (Releasing Polynomials). Let Q = {qy} ye Y Q c{o i} m ^ e a se ^ °f coun ^ n 9 queries 
over {0, l} d , and J-q be the Q function family (Definition \2.5\) . Assume that Vt,T efficiently and 
uniformly 7- approximates J-q on Yq (Definition ^ 6\) . Then there is a sanitizer A: ({0, l} d ) n — > 
MS t ) that 
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1. is e- differentially private, 

2. runs in time poly(n, d, ( m ^ t ) , log T, log(l/e)), and 

3. is [a, p)-accurate for Q for a = 7 H v ' — £n " ' — '-. 

Proof. First we construct the sanitizer A. See the relevant codebox below. 
The Sanitizer A 

Input: A database D £ ({0, l} d ) n , an explicit family of polynomials V, and a parameter e > 0. 
For i = 1, . . . , n 

Using efficient approximation of J 7 by T 3 , compute a polynomial = 'Pjr(xW) that 7- 
approximates f x (i) on Yg. 

Let p D = — Y^l=i P x ( l ) > where the sum denotes standard entry-wise vector addition. 

Let pd = pd + Z , where Z is drawn from an ( m ^~*)-variate Laplace distribution with parameter 

2T/en (Section EH). 

Output: p£). 



Privacy. We establish that A is e-differentially private. This follows from the observation that 
for any two adjacent D ~ D' that differ only on row i* , 



~>D - PD' 



^ n 1 n 



i=l 



i=l 



1 

n 



- WPxV) ~ Vx><.**)\\™ ^ — • 



2T 

n 



The last inequality is from the fact that for every x, p x is a vector of norm at most T. Part 1 
of the Theorem now follows directly from the properties of the Laplace Mechanism (Theorem 12.3ft . 
Now we construct the evaluator £. 



The Evaluator £ for the Sanitizer A 

Input: A vector * > and the description of a query y E {0, l} m . 

Output: p(y). Recall that we view p as an m-variate polynomial, p, and p{y) is the evaluation 
of p on the point y. 



Efficiency. Next, we show that A runs in time poly(n,<Z, (™ +t ),logT,log(l/e)). Recall that 
we assumed the polynomial construction algorithm V runs in time poly (d, ( m +*) , log T) . The 
algorithm A needs to run Vj on each of the n rows, and then it needs to generate { m ^ t ) sam- 
ples from a univariate Laplace distribution with magnitude poly (T, ( m ^~*), 1/n, 1/e), which can 
also be done in time poly(( m ^~*) , log T, log n, log(l/e)). We also establish that £ runs in time 
poly(( m t +t ),logT,logn,log(l/e)), observe that £ needs to expand the input into an appropriate 
vector of dimension ( m ^~*) and take the inner product with the vector p, whose entries have mag- 
nitude poly(( m ^"*) , T, 1/n, 1/e). These observations establish Part 2 of the Theorem. 
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Accuracy. Finally, we analyze the accuracy of the sanitizer A. First, by the assumption that Vt,T 
uniformly 7-approximates T on Y C {0, l} m , we have 



max|/D(y) -Pd(v)\ 

y£Y 



max 

y&Y 



< 



t=i t=i 

- Vrnaxl/^)^) -p; C i)(y)| < 7. 



Now we want to establish that Pr 



max, 



a 



y£{o,i} m \Pd{v) -Pn{y) 
4r( m t +t ) 2 log(( m + i )//3) 



< a' 



> 1 - p for 



S7?. 



where the probability is taken over the coins of A. Part (3) of the Theorem will then follow by the 
triangle inequality. 

To see that the above statement is true, observe that by the properties of the Laplace mechanism 



(Theorem [2]3]), we have Pr 



PD ~PD 



> 1 — (3, where the probability is taken over the coins 



of A. Given that \\pd — Pd\\i < 0/ , it holds that for every y G {0, l} m , 

PD{y)-PD{y) = (pD-pD){y) < \\Pd -pb||i < a'- 

The first inequality follows from the fact that every monomial evaluates to or 1 at the point y. 
This completes the proof of the theorem. 

□ 

Using Theorem l2.4l we can improve the bound on the error at the expense of relaxing the privacy 
guarantee to (e, <5)-differential privacy. This improved error only affects the hidden constants in 
Theorems I1.1H1.31 so we only state those theorems for e-differential privacy. 

Theorem 3.2 (Releasing Polynomials, (e, <5)-Differential Privacy). Let Q = {qy} y€YQ c{o i} m ^ e a 
set of counting queries over {0, l} d , and Tq be the Q function family (Definition \2. 5\) . Assume that 
Vt,T efficiently and uniformly 7- approximates Tq on Yq (Definition \2. 6\) . Then there is a sanitizer 
A: ({0, \} d ) n r(™ + ') that 

1. is (e, 5) -differentially private, 

2. runs in time poly(n,<i, ( m t +t ) , log T, log ( 1 /e) , log ( 1 /5) ) , 



3. is (a, /3)-accurate for Q for = 7 + 



12T(^)ygg) log(l/g) log(('"+ t )//3) 



The proof of this theorem is identical to that of Theorem 13.11 but using the analysis of the 
Laplace mechanism from Theorem 12.41 in place of that of Theorem 12. 3[ 



4 Applications 

In this section we establish the existence of explicit families of low-degree polynomials approximat- 
ing the families J-q for some interesting query sets. 
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4.1 Releasing Monotone Disjunctions 

We define the class of monotone k-way disjunctions as follows: 

Definition 4.1 (Monotone A;- Way Disjunctions). Let X = {0, l} d . The query set QDisj.fc = 
{ ( ?j/} ?/e y fe c{o i} d °^ monotone k-way disjunctions over {0, l} d contains a query q y for every y G 
Y k = {y G {0, l} d | |y| < k}. Each query is defined as q y (x±, . . . , x,i) = \fj = iUjXj- The QDisj,fc 
function family Js Disj >k = {fx} xe { ^d contains a function fx{yi, ■ ■ ■ ,Vd) = Vj=i for every 
x G {0, l} d . 

Thus the family J~Q DiBi k consists of all disjunctions, and the image of QDisj,fc> which we denote Yf., 
consists of all vectors y G {0, l} d with at most k non-zero entries. We can approximate disjunctions 
over the set Y k using a well-known transformation of the Chebyshev polynomials (see, e.g., |17t 
Theorem 8] and [151 Claim 5.4]). First we recall the useful properties of the univariate Chebyshev 
polynomials. 

Fact 4.2 (Chebyshev Polynomials). For every k G N and 7 > 0, there exists a univariate real 
polynomial gk(x) = X^=o Ci3;J °f degree tk such that 

1. t k = 0(VHog(l/ 7 )) ; 

2. for every i G {0, 1, . . . ,t k } , \a\ < 2°^ lo s( 1 h)) , 

3- gk(0) = 0, and 

4- for every x G {1, . . . , k}, 1 — 7 < gu{x) < 1 + 7. 

Moreover, such a polynomial can be constructed in time poly(/c, log(l/7)) (e.g. using linear pro- 
gramming, though more efficient algorithms are known). 

We can use Lemma 14.21 to approximate fc-way monotone disjunctions. Note that our result 
easily extends to monotone k-way conjunctions via the identity 

^j=i x jUj = 1 — Vj- =1 (1— x j)yj. Moreover, it extends to non-monotone conjunctions and disjunctions: 
we may extend the data universe as in [151 Theorem 1.2] to {0, l} 2rf , and include the negation of 
each item in the original domain. Non-monotone conjunctions over domain {0, l} d correspond to 
monotone conjunctions over the expanded domain {0, l} 2d . 

The next lemma shows that ^"g Di8i k can be efficiently and uniformly approximated by poly- 
nomials of low degree and low norm. The statement is a well-known application of Chebyshev 
polynomials, and a similar statement appears in [15] but without bounding the running time of the 
construction or a bound on the norm of the polynomials. We include the statement and a proof 
for completeness, and to verify the additional properties we need. 

Lemma 4.3 (Approximating J~Q Disi k by polynomials, similar to |15j). For every k, d G N such that 
k < d and every 7 > 0, the family Vt,T of d-variate real polynomials of degree t = 0(Vk log(l/7)) 
and norm 

T = d O(Vfclog(l/ 7 )) efficiently and uniformly 7- approximates the family J" Qd . s . k on the set 

Y k . 
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Input: a vector x G {0, l} d . 

Let: g k be the polynomial described in Lemma 14.21 

Let: p x G R y '* ; be the expansion of p x (yi, ■■■,yd)=9k [12j=i Vj x j 
Output: p x . 



Proof. The algorithm Poisj,*; f° r constructing the polynomials appears in the relevant codebox 
above. 

Since p x is a degree-t k polynomial applied to a degree-1 polynomial (in the variables yj), its 
degree is at most t k . To see the stated norm bound, note that every monomial of total degree i in p x 

comes from the expansion of f^j=i Vj x j^j > an d ever y coefficient in this expansion is a non-negative 
integer less than or equal to k r . In p x , each of these terms is multiplied by Cj (the z-th coefficient 
of g k ). Thus the norm of p x is at most max ie{0i i,..., tfc } V ■ \ci\ = fc°(^ lo s(V7)) = ^(^^(1/7)). 
To see that "PDisj.fc is efficient, note that we can find every coefficient of p x of total degree i by 

expanding (^2j=i yj x jj an °f its ^ terms and multiplying by Cj, which can be done in time 

poly(<i* fe ) = poly(( "J"**)), as is required. 

To see that PDisj,ft 7-approximates Tqq^ k , observe that for every x,y G {0, l} d , f x (y) = =>• 
Px(y) — 0, and for every x G {0, l} d , y G Y k , f x (y) = 1 =>■ 1 — 7 < Pi(y) < 1 + 7- This completes 
the proof. □ 

Theorem 11.11 in the introduction follows by combining Theorems 13.11 and 14.31 

4.2 Releasing Monotone r-of-k Queries 

We define the class of monotone r-of-fc queries as follows: 

Definition 4.4 (Monotone r-of-k Queries). Let X = {0, and r, k G N such that r < k < d. 
The query set Q rj fc = {Qy} ye Y k c{o i} d °^ monotone r-of-k queries over {0, l} d contains a query g y 
for every y G Y k = {y G {0, l} d | \y\ < k}. Each query is defined as q y (x 1 , ...,x d ) = 1 j2 d _ 1 y j xj>r- 
The Q r , k function family F Qrk = {/x} ze { ,i}d contains a function f x (yi, ...,y d ) = ^j-i v,xj>r for 
every x G {0, l} d . 

Sherstov [2U\ Lemma 3.11] gives an explicit construction of polynomials that can be used to ap- 
proximate the family J r Q r , over Y k with low degree. It can be verified by inspecting the construction 
that the coefficients of the resulting polynomial are not too large. 

Lemma 4.5 ([2D])' For every r, k G N such that r < k and 7 > 0, there exists a univariate 
polynomial g r>k ■ K — > E of degree t r ^ such that g r ,k(x) = °i xi an< ^ 

1. t r>k = O (Vrklog(k) + Vfelog(l/7)log(*:)), 

2. for every ie {0,1,..., t k },\ Ci \ < 2 6 W krlo ^h)) } 

3. for every x G {0, 1, . . . , r — 1}, —7 < g r ,k{x) < 7, and 
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4- for every x G {r, . . . , k}, 1 - 7 < g r ,k( x ) < 1 + 7- 
Moreover, g r ^ can be constructed in time poly(fc, r, log(l/7)) (e.g. using linear programming). 

For completeness we include a proof of Lemma 14.51 in the appendix. We can use these polyno- 
mials to approximate monotone r-oi-k queries. 

Lemma 4.6 (Approximating J~Q rk on Yf~). For every r,k,d £ N such that r < k < d and every 
7 > 0, the family Vt,T of d-variate real polynomials of degree t = 0{sjkr log(l/7)) and norm 
T = jO(\/fcrlog(i/7)) efficiently and uniformly ^-approximates the family J~Q r k on the set Y^. 

Proof. The construction and proof is identical to that of Theorem 14.31 with the polynomials of 
Lemma 14.51 in place of the polynomials described in Lemma 14.21 □ 

Theorem 11.21 in the introduction now follows by combining Theorems 13.11 and 14.61 Note that 
our result also extends easily to non- monotone r-oi-k queries in the same manner as Theorem 1 1.11 

Remark 4.7. Using the principle of inclusion- exclusion, the answer to a monotone r-of-k query 
can be written as a linear combination of the answers to k°^ monotone k-way disjunctions. Thus, 
a sanitizer that is (a / k° ( r \ f3)- accurate for monotone k-way disjunctions implies a sanitizer that is 
(a, ^-accurate for monotone r-of-k queries. However, combining this implication with Theorem \l.l\ 
yields a sanitizer with running time 

dO(tVSiog(fc/0) which has 

a worse dependence on r than what 

we achieve in Theorem \1.2l 

4.3 Releasing Decision Lists 

A length-k decision list over m variables is a function which can be written in the form "if l\ then 
output b\ else ••• else if £k then output bk else output fefc+i," where each ti is a boolean literal 
in {xi, . . . ,x m }, and each 6j is an output bit in {0, 1}. Note that decision lists of length-A; strictly 
generalize /c-way disjunctions and conjunctions. We use DLfc j?n to denote the set of all length-A; 
decision lists over m binary input variables. 

Definition 4.8 (Evaluations of Length-A: Decision Lists). Let k,m G N such that k < m and 
X = DLk, m - The query set Quh k m = {Qy} y( z^ iy m of evaluations of length-k decision lists contains 
a query q y for every y G {0, l} m . Each query is defined as q y {x) = x(y) where x G DL^ im is 
a length-A: decision list over m variables. The QoL t function family Fg„, = {fxKcnT 
contains functions f x (y) = x{y) for every x £ DLfc jm . That is, J~Q DLk = DLfc >m . 

We clarify that in this setting, the records in the database are length-A: decision lists over {0, l} m 
and the queries inputs in {0, l} m . Thus \X\ = |DLfc,m| = niP^ and \Q\ = 2 m . Alternatively, 
X = {0, l} d for d = k(logm + 2) + 1, since a length-A; decision list can be described using this 
many bits. Klivans and Servedio [171 Claim 5.4] have shown that decision lists of length k can be 
uniformly approximated to accuracy 7 by low-degree polynomials. We give a self-contained proof 
of this fact in Appendix [A] for completeness. 

Lemma 4.9 (|17|). For every k,m G N such that k < m and every 7 > 0, the family Vt,T of 
m-variate real polynomials of degree O C\/klog(l/ r y)\ and norm T = m^^ 10 ^ 1 / 7 )) efficiently and 
uniformly 7- approximates the family J~Q DLk = DLfc m on all o/{0, l} m . 

We obtain Theorem 11.31 of the introduction by combining Theorems 13.11 and 14.91 
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5 Generalizations and Limitations of Our Approach 



We note that the approach we take is not specific to low-degree polynomials. Theorem 13 . 1 1 extends to 
the case where the family Tq is efficiently and uniformly approximated on Yq by linear combinations 
of functions from any efficiently computable set S (In the case of polynomials, S is the set of all 
monomials of total degree at most t). The properties we require from the function family S are that 
it (1) it is relatively small (as it determines our running time and minimum database size), (2) for 
every x G X, we can efficiently compute b x G such that max y6 y Q \ f x (y)— SseS b x ,s-s{y)\ < 7, and 
(3) s(y) is small (say, at most c) for every s G S and y G Yq. In the special case of approximation by 
m-variate real polynomials of degree t and norm T, we can take S to be the set of monomials of total 
degree t, thus |5| = ( m ^~ ) and c = 1. If we have those parameters, then similarly to Theorem 13. II we 
can obtain an e-differentially private sanitizer with running time poly(n, log \X\, \S\, log T, log(l/e)) 
and accuracy a = 7 + 0(cT\S\ 2 log(\S\/(3)/en). 

Unfortunately, it turns out that, for all of the query sets considered in this paper, there is not 
much to be gained from considering more general functions families S. Indeed, Klivans and Sherstov 
\18\ Theorem 1.1] show that even if S consists of arbitrary functions s±, . . . , sg : {0, l} k — > R whose 
linear combinations can uniformly approximate every monotone disjunction on k variables to error 
±1/3, then I > 2°( v "). Note that, up to logarithmic factors, this matches the dependence on k of 
the upper bound of Lemma 14.31 Moreover, Sherstov [211 Theorem 8.1] broadly extends the result 
of p2] beyond disjunctions, to pattern matrices of any Boolean function / with high approximate 
degree. Roughly speaking, the pattern matrix of / corresponds to the query set consisting of 
all restrictions of /, with some variables possibly negated (see [21] for a precise definition). In 
particular, [21^ Theorem 8.1] implies a function family independent lower bound for non-monotone 
r-oi-k queries; up to logarithmic factors, this lower bound matches the dependence on r and k of 
the upper bound of Lemma 14.61 

We also note that our algorithm can be implemented in Kearns' statistical queries model. In 
the statistical queries model, algorithms can only access the database through a statistical queries 
oracle STAT(D,r) that takes as input a predicate q: {0, l} d — > [0,1] and returns a value a such 
that E^gD [q(x)] G [a ± r]. It can be verified that our algorithm can be implemented using ( m f +t ) 
queries to STAT(D, r) for r = l/( m +*)T 2 , using one query to obtain each noisy coefficient (scaled 
to the range [0, 1]). In this case of /c-literal conjunctions, our algorithm makes queries 
and requires r = l/d ^ 1 ^ 1 /")). 

Connections between differentially private data analysis and the statistical queries model have 
been studied extensively [2"1 1161 HTTj . Gupta et. al. showed that lower bounds for agnostic learning 
in the statistical queries model also imply lower bounds for data release in the statistical queries 
model. Specifically, applying a result of Feldman [10], they show that for every a = a(d) = o(l), for 
k = k(d) = @(log(l/a((i))) = co(l), and for every constant c > 0, there is no algorithm that makes 
d c queries to STAT(Z), cT c ), for D G ({0, l} d )*, and releases a-accurate answers to all monotone 
A;-way disjunction queries on D. Notice that our algorithm only makes a polynomial number of 
statistical queries when both k and a are constant. 
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A Polynomial Approximation of Decision Lists 

Lemma A.l (Theorem l4 . 91 restated. |17|). For every k,m £ N such that k < m and every 7 > 0, the 
family V t ,T of m-variate real polynomials of degree O (y/klogQ/j)^ and norm T = j7t,^( v ^ 1 °s( 1 /t)) 
efficiently and uniformly 7 -approximates the family J~q DIi = DL& im on all o/{0, l} m . 

Proof. By Theorem 13. 11 it is sufficient to show that if f{y) is any length-A; decision list over {0, l} m , 
then f(y) can be 7-approximated by an explicit family of polynomials of degree 0{yk\og(l/^f)) 
and norm rn ^^ 10 ^ 1 ^^ . To this end, write f(y) in the form, "if i\ then output b\ else • • • else 
if £k then output bk else output bk+i" where each ii is a boolean literal, and each b{ is an output 
bit in {0, 1}. Assume for notational convenience that l{ = yi for all i; the proof for general decision 
lists is similar. 

Following |17| Theorem 8], we may write 

f(y) = hyi + b 2 {i - 2/1)2/2 
+ ... 

+ b k (l - yi) ... (1 - y k -i)Vk + - yi)(l - i/2) • • • (1 - Vk)- 

At a high level, we treat each term of the above sum independently, using a transformation of 
the Chebyshev polynomials to approximate each term within additive error 'y/k. This ensures that 
that the sum of the resulting polynomials approximates /q,x(j/) within additive error 7 as desired. 
Details follow. 

Let <7fc be the polynomial described in Lemma 14.21 with error parameter 7' = 7/fc. Then the 
polynomial h k {z) = 1 — g k {k — z) satisfies the following properties: 

1. The degree of h k is t k = 0(Vklog(k/j)), 
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2. for every i E {0, 1, . . . ,t k }, the i-ih coefficient, Cj, of h k has magnitude |q| < 2^( v/ ^ log ( 1/ ' 7 ^, 

3. h k (k) = 1, and 

4. for every * E {0, . . . , k - 1}, \h k (z)\ < 

Moreover, h k can be constructed in time poly(/c, log(l/7)) (e.g. using linear programming, though 
faster algorithms for constructing h k are known). 
Consider the polynomial p x defined as 

Px(y) = b 1 -h k (y 1 + (k-l)) + b 2 -h k ((l-yi) + y 2 + (k-2)) + ... 
+ b k -h k ((1 - Vl ) + (1 - y 2 ) + • • • + (1 - y k -i) + Vk) 
+ b k+1 ■ h k ((1 - Vl ) + (1 - y 2 ) + ■ ■ ■ + (1 - i/ fc )) . 

It is easily seen that p x is has degree 0(Vk\og(k/j)) and the absolute value of each of the 
coefficients is at most 2 d ^ l °^ 1 /^ . Moreover, \p x (y) - f(y)\ < 7 for all y G {0, l} m . This 
completes the proof. 

B Polynomial Approximation of r-oi-k Queries 

Lemma B.l (Lemma 14.51 restated. [20] ) . For every r,k EN such that r < k and 7 > 0, there exists 
a univariate polynomial g rk : R — > R of degree t r ^ such that g r ,k{ x ) = Si=o CiX% an< ^ : 

1. t r>k = O (y^log(fc) + ^log(l/7)log(^)) , 

2. for every ie {0, 1, . . . , t k } , \(h\ < 2 6 ^ krl °^ 1 ^ , 

3. for every x E {0, 1, . . . , r — 1} ; —7 < g r ,k{x) < 7, and 

4. for every x G {r, . . . , k}, 1 - 7 < £f r ,fc(x) < 1 + 7. 

Moreover, g r k can be constructed in time poly(fc, r, log(l/7)) ('e.g. using linear programming). 

We give details for the construction of |2CH Lemma 3.11]. We do not prove the approximation 
properties (which is done in [2U]), but just confirm the sizes of the coefficients. The result is 
trivial if r = Q(k) or log(l/7) = Q(k) since every symmetric function on {0, 1, . . . , k} has an exact 
representation as a polynomial of degree k, so assume r < A:/ log 2 k and log(l/7) < k/logk, with 
k>2. 

Let T z be the degree z Chebyshev polynomial of the first kind (Fact \B.2\i . We will use the 
following well-known properties of Chebyshev polynomials. 

Fact B.2. The Chebyshev polynomials of the first kind satisfy the following properties. 

1. Each coefficient ofT z has absolute value at most 3 2 . 

2. T z (t) > 1 for all t > 1. 

Let A = [ 10 fog^ ] ; an d z = 3A[logfe]. The construction proceeds in several steps, with the 
final polynomial p defined in terms of multiple intermediate polynomials. 
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1. For any fixed integer < I < r, let 

Pi,e{t) = T, 



All the coefficients of Pi/(t) are bounded in absolute value by 3 deg ( pM ) = 2 VV <+ A 
2. Define 



PiJt) -piAk - f) x 2 



The coefficients of p2 y i are bounded in absolute value by 



k O(deg( P2ii )) = k o(deg( Phl )) = 2 (,V^rJ. 

3. Define the polynomial 



P3,e(t) := T rs(d+1)(£+A) -| ( 1 + -777- A , 2 - P2(t) ) • 



Psi(t) has degree at most 



deg(p 3 ,£) = 22((i+l)VM^ + A)/A 

= O (V fc (^ + A) log A;) = (v / fc£logfc+ vAlog/clogtlAy)) • 



Noting that 64 ^ A ^ < 1, it is clear that the coefficients of P3/(t) are bounded in absolute 
value by k°^ p ^l 

4. Define the polynomial P4/(t) = • Since 

P3(fc-£) = Tp^^ (l + 6 4(£ + A)^ " °) ' 

by Part 2 of Fact IB.21 the absolute values of the coefficients of p$ £ are no larger than those 
of p 3 . 

5. Define the univariate polynomial 

qi(t)= I] (t-(k-l-i)). 

i=-(A-l),...,(A-l),#0 

The coefficients of qi(t) have absolute value at most . 

6. Define the polynomial 

M*) = ^T^-PM(*)9l(*)- 

Noticing that |<?i(& — £)\ > 1, it is clear that the degree of q^,i is 0(A + deg(p4^)), and the 
absolute values of the coefficients of q2,e(t) are at most 2°^ +deg ( p3 - f ^. 
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7. Define the polynomial q3/(t) = q^iik — t). The absolute value of the coefficients of q$i are 
also bounded above by 2 d( - A+dc ^P^)) . 

Sherstov's arguments show that q$ £ is a ^-approximation to the function 



EXACT e (t) 



lift = £ 

for all other t G {0, . . . , k} 



8. The final polynomial pit) is defined as 

p(*) = 1 - E 

te{0,l,...,r-l} 

The coefficients of have absolute value at most 2 ^ +deg ^ 3 - r ^, and p has the desired 
degree. Moreover, pit) 7-approximates the function l t > r (t) on the set {0, . . . , k}, since the 
jth term in the sum Ylee{o 1 r-i} 93/(0 is a 7/A; approximation for the function EXACTj(t). 

Since log(l/7) < jjh:, it holds that A < sjk log I/7. Thus, the coefficients of pit) have 
absolute value at most 2 6( - dc zM) = 2 0(^+V klo sWi)) . 

□ 
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